DOCOHSNT BESUHE 



ED 098 626 



CS 500 840 



AOTHOB 
TITLE 

PUB DATE 
MOTE 



BDRS PRICE 
DESCBIPTOBS 



Richards, tfilliaa D., Jr. 

Hettfork Analysis in Large conplex Systeas: Techniques 
and Methods — Tools. 
Apr 7t» 

MOp.; Paper presented at the Annual Heeting of the 
International coaaunication Association (Mev Orleans, 
Louisiana, April 17-20, 197U) 

HF-$0.75 HC-$1.85 PLUS POSTAGE 
Analog Coaputers; *Coapater Prograas; Higher 
Education; Inforaation Processing; ♦Networks; 
♦Organizational Coaaunication; ♦Prograa Descriptions; 
♦Systeas Analysis 

ABSTRACT 

Divided into five aajor sections, this paper 
describes a new algoritha which has been iapleaented in an extended 
FORTRAH prograa which runs on a CDC 6500 coaputer. The first section 
of the paper briefly outlines the goals of network analysis and 
presents the context in which these goals aust be aet. Section 2 
describes the algoritha and the rationale behind it. In section 3 
soae especially iaportant prograaing considerations are described, 
and section U covers soae general characteristics of running the 
prograa. The final section of the paper briefly describes the 
historical developaent of this algoritha. (RB) 
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TBCHNIQOBS AND MBTfiODS«-100I3 

In 1^71 a formsl algorithm for analyzing conm*inlcation net- 
works in larqe complex oi ^anizations was presented, (l) Sincr> 
that time there have been m.Tiny advances in the area— some conceptual 
(2) and others operational. This paper will describe a new algorithm 
which has been in^lemented in an Bsctended FORTRAN program which 
runs on a CDC 6500 con^uter, (3) This algorithm could be realized 
on any large general purpose machine » and it far surpasses any other 
similar analytic technique we are aware of » in terms of utility » 
capacity, and efficiency. 

This paper will not discuss the theoretical basis of network, 
analysis, nor will It report any empirical findings. For coverage 
of these areas, the reader is urged to see (2) and (4). Because thr> 
computer program mentioned above is highly complc^x and system-depen- 
dent, the actual code will not be presented, it should be possible, 
however, to write a similar program for any given machine with the 
information that will be presented here. 

The paper will be divided into fiv« major sections. The first 
will briefly outline the goals of network analysis and present tho 
context in which these goals must be met. The second will describe 
the actual algorithm and the rationale behind it. In section three 
some especially important programming considerations are described; 
section four covers some general characteristics of the running 
program. Finally, the last section will describe briefly the historlca 
development of this algorithm. Throughout the major portion of the 
first three sections, the approach taken In the description of techniqu 
will be to discuss goals and constraints of each facet of the analysis! 
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and then to show that the methods used are able to efficiently 
meet the qoals, qiven the constraints* 

Because much of the technique of network analysis depends 
on the theory of network analysis, it is difficult to discuss analytic 
m(>>thods without referrinq to the theoretical basis, which is covered 
extensively in (2). Rather than repeat;inq discussions which are 
qiven bettor treatment in (2), the symbol will be used to 

indicate that a point covered in this text is discussed more fully 
in '2). 

Part one — ^Metwork Analysis t The Problem 

The qoals of network analysis are to a) detect and b) describe 
any structuring at each of three levels of the communication network. 
(The nature of structurinq in complex systemr is covered at lenqth 
in (5) and will not be discussed here.) These three levels are the 
individual, the group, and the whole-system. The detection of struc- 
ture '13 a straightforward statistical problem which is easily done. 
If there is any structuring, it is then possible to describe it, and 
to analyze various characteristics of the system at different levels 
of analysis. The kinds of characteristics that can be examined will 
also not be covered in this paper) they are described in (6). 

The basic problem we are thus faced wi.i is thiss Given some 
particular network which we know to be structured, we must determine 
what the units of analysis are at the various levels of analysis. 
At the individual and whole-system levels this doesn't seem to be 
much of a problem. At intermediate levels, however, the determination 
of the component boundaries is a cnplex problem. This is the problem 
area upon which our attention is focused in this paper. 
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What is meant by ••the determination of component boundaries 
at intermediate levels-? Basically thist if the system as a whole 
is structured, it will show differentiation into parts. (5,7) These 
parts will take the form of groups of individuals which will meet 
certain specified criteria. We would like to know who is in these 
groups. This, then, i» our main goal. 

What are the constraints under which we must meet this goal? 
They center on the nature of the data we b%'7B available. The data 
ususlly take the form of lists of links between pairs of nodes. 
There is no limit on how many links any individual may have— i.e. 
the number of links may vary from one individual to the next — there 
is no set number.^ All we know about each link is who it connects 
and how strong it is. (We may also know if it is directed or not.) 

A mojor consideration is size of network— we would like to 
be able to study very large systems— we want a maximum capability of 
at least a thousand nodes, and it might be nice to be able to look 
at networks with several thousand nodes. In networks we have ob- 
served, the number of links per node has ranged from zero to a maximum 
average of about twenty. If we allowallAit of 1000 nodes, we should 
have room enough for at least twenty times that number of links. This 

is a lot of data! 

Network data is not like most data we are used to seeing in the 
social sciences. The data elements do not describe properties of 
individuals. Rather, they describe properties of relationships 
between individuals. What we have, then, is a topological problem. 
We cannot approach this problem with the Euclidian distance paradigm 
used to structure other kinds of data, We cannot, therefore, hope to 
use the raathmatical tools which produce unique exact solutions based 
on a distance model. Instead, we use heuristic pattern-recognition 
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techniques which result in topological representations which may 
then be characterized alono a number of dimensions. 

The problem then is to arrange and represent the data in such a 
way that it becomes possible to see the groups. \/ie do not uso 
traditional scaling approaches because our data do not fit those 
models. Instead we use less elegant pattern-recognition techniques. 
These techniques are described in the next part of the paper. 



Part Two — ^The Algorithm 

In any searching procedure it is necessary that the investigator 
have a good idea of what it is that he is looking for. This common 
sense notion cannot lightly be dismissed, especially if we wish to 
write a computer program to do the searching for us. coraputr?rs, 
like chickens, are monumentally stupid. Unlike chickens, however, 
computers will do what we want them to do if we tell them exactly 
what to do and how to do it. This means that ife must know exactly 
what we are looking for. This demand for precision lead to the 
somewhat complex definition of groups and other network roles which 
is presented here,** 

T. Nodes may be of two types — participants and non-participants . 
Non-participants are either not connected to the rest of the 
network or are only minimally connected. They include i 

A. Isolate type one. These nodes have no links of any kind, 

B. Isolate type two. These nodes have one link. 

C, Isolated dyad. These nodes have a single link between them- 
selves • 

D, Tree node. These nodes have a single link to a participant, 
and have some number of other isolates attached to them. 



ERIC 



BEST COPY AVAILABLE 



II. Participants are nodes that have two or more links to other 
participant nodes. They maVe up tne bulk of the network in 
most cases, and allow for the development of structure*. They 
include t 

A. Group member. A node with more than some percentage of his 
linkage with other members of the same group, (this percent 
is called the aloha -percent or ol -percent) 

B. Liaison. These nodes fail to meet the o«v-percent criterion 
with members of any single group, but do meet it for members 
of groups in general. 

C. Type other. These nodes? fail to meet the oC-percent criter- 
ion for any set of group members. 

III. To be called a group, a set of nodes must satisfy these five 
criteria. 

A. There muf«t be at least three members. 

B. Each must meet the (^-criterion with the other members of 
this group. 

C. There must be some path, lying entirely within the group, 
from each member to each other member. (This is called the 
connectiveness criterion. ) 

D. There may be no single node (or arbitratily small set of 
nodes) which, when removed from the group, cause the rest 
of the group to fail to meet any of the above criteria. 
(This is called the critical node criterion. ) 

E. There must be no single link (or subset of links) which, 
if cut, causes the group to fail to meet any of the above 
criteria. (This is called the critical link criterion.) 

Obviously, if we think a certain set of nodes might be a group, 
we can be sure by applying tests to see if the set meets the five 
group criteria. If there are slight errors, we can adjust the 
group bo^mdary by adding or removing nodes to or from the group. 
In other words, if we can get an approximate answer to the question 
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of group boundaries, we can ♦•clean it up" and nvjke it exact by 
the application of the criteria. 

It turns out that we can take advantage of this fact nn^ make 
significant savings because of it. Thio itt b«eatt«# it is easier 
to make an educated guess out the group structure and then adjust 
this to an exact solution than it is to begin right away with a 
search for the exact solution. The algorithm to be presented in 
this paper follows this two-staged approach. In the first stage, 
the data are rearranged in such a way that an approxiroato solution 
is readily obtained. In the second, that tentative solution is 
tested and cleaned up so that it becomes exact enough to satisfy 
the criteria. 

The Approximate Solution 

There are three stages to this part of the analysis, in the 
first stage the non-participants are identified and removed for the 
rest of the analysis. This is done because the presence of non- 
participants only serves to complicate things by increasing the 
number and variety of nodes in the analysis. The non-participants 
arp easily identified by their patterns of interaction with other 
nodes. 

in the second and third stages of this part of the analysis 
the data are re-arranged and tentatively partitioned into parts 
which will correspond roughly to the final group structure. 

Re-arranging the Data to Make the Groups Visible 

This part of the analysis is essentially a refined version of 
the algorithm that was presented in 1971. (1) What is being done 
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can be understood easily with the followinq analoqy. Imaqine 
thr nodes to be UK- billiards balls scattored about ^r^ spar.-, 
imagine there to be rubber bands connecting th. ^^ns o,rr..sp.>ndi nn 
to nodes with links between them. Imagine there to be sprigs 
between balls corresponding to nodes that do not have links between 
them*/ The rubber bar.ds will act to pull the balls connected to 
each other closer to each other, while the springs will push the 
balls not connected to each other apart from each other. If wo 
hook up the rubber bands and springs and release the tells, they 
will re^arrange themselves so that the balls corresponding to nodes 
with links to each other will be close to each other, while the 
balls corresponding to nodes that are not linked to each other will 
be pushed away from each other. This example Is shown in Figur.> 1. 

We could refine this technique by using heavier rubbor band« 
to represent the links that occur more often or are more Important. 
Since our objective here is to make it easier to identify groups, 
we could make the process work even better if we could make the 
rubber bands for within^roup links heavier than the ones for other 
kinds of links, in order to do this, we will need som« Indicator 
that tells us which links look like within group links. 

If two nodes are in the same group, they are likely to have 
many links to the same people. There is likely to be a high number 
of shared links, or two step links between this pair of nodes. U 
they are not in the same group, they are not likely to talk to the 
same people, and there ar« not likely to be many two-step links be- 
tween the nodes. Thus, the number of two-step links is used as 
an indicator of the probability that the link is a within group 

link. 
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Figure 1 

This figure illustrates the billiard ball and rubber band 
model described in the text. The network shown has two groups 
of thiee nodes each. The three drawings represent three successive 
increments of time, as the nodes move farther and farther in 
response to the forces exerted by^ the rubber bands. 

The original position of the balls is shown by the shaded 
circles in the top drawing. Movement of balls during each timt* 
increment is shown by the dotted arrows in the three drawings. 
The scale was changed in going from the first to the second to 
the third drawing, in order to show smaller and smaller regions 
in space as occupying the same sized area in the drawings. The 
region of the top drawing shown in the middle one is Indicated 
by the dotted box in the top. Similarly, the area of the bottom 
drawing is shown by the dotted box in the middle one. 
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NOW, it is hard to represent large numbers of points in 
multi-dimensional space. It takes a lot of information to do 
this, and it is fairly difficult to move objects in this kind of 
a space. Extensive experimentation with real data, however, 
showed that it was not necessary to use a multi -dimensional 
representation for this analysisi a single line secpnent was 
sufficient. This kind of reduction in complexity of representa- 
tion greatly reduced the amount of information needed to perform 
the analysis at the same time it made the analysis itself easier 
to do. 

The analysis is performed as followsi Modes are scattered 
at unit points along a line segment N units long, where N is the 
number of nodes. We then treat each link from, say, node A to 
node B, as a vector, starting at A and pointing at B. We take 
all the vectors for each person and compute the average, weighting 
the individual vectors for strength of the link and probability 
that the link is a within-gioup link. We then get a single point 
for each individual, that point being the mean of that person's 
vectors. This is illustrated in Figure 2. After all the means 
nave been computed, each node is moved to the point Indicated 
by his mean. 

After this process has been completed, nodes with links to 
each other will be closer to each other than they were before. 
They will not, however, be as close as they could be. This fact 
is due to the way nodes are scattered initially, and also because 
of the statistical properties of the mean. For this reason, the 
entire process is repeated, using the new locations instead of the 
original positions used for the first set of calculations. A 
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Figure 2 

At the top 6f this figure is shown a hypothetical network 
consisting of tiro groups » each of which has three members. 

The diagram in the middle shows how the six nodes are 
initially placed along a line segment. The two solid 
arrows pointing to the right in the t6p of this figure are 
the vectors representing the links of Node ;^1 to Node #2 and 
Node #6, The dashed arrow between the solid ones is the 
average of the two. 3elow the line segment are shown the vectors 
for the links of Node #6. 

The diagram on the bottom of Figure 2 shows how the iterative 
process of vector averaging works. The first line shows the 
initial positions of the six nodes. The second shows what the 
means could look like. Moving from the second to the third lines, 
the scale has been expanded so that the nodes range over the 
entire length of the continuum. The fourth and sixth lines 
show the sec ond and third sets of means i while the expanded 
.'ersions are shown on the fifth and seventh lines. (Note that 
the values shown are not the actual values that would be obtained 
for this particular network i they are intended merely to illus- 
trate how the process might typically look.) 
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Plot showing how the nodes moved in successiv* itorations is 
shown in the bottom half of Figure 2. Betw<?Gn «ach sot or 
calculations it is necessary to expand the scale of the continuum 
so that the spread or range which is occupied by the nodes 
remains N units long. If this is not done, the points will 
move closer and closer to each other, finally collapsing on a 
single spot. This is the ••scale expansion'* referred to in Figure 
2. 

The formula used for calculating a person's mean Is shown 
here I 



M* = 



£ (wf^'S^'Mj^) 

fwfpsJT 
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where wf^ is the two-step weighting factor described above; 

is a ratio-level indicator of the strength of the linki and 

is the old mean of the person to whom the link goes. The summation 

is done as i goes from 1 to {, where { is the number of links that 

the individual whose mean we are calculating has. 

In the development of this algorithm different numbers of 
iterations I different ways of varying relative contributions of 
wf^'s, s^»s, and M^'si and differ«fit ways of assigning the original 
M^'s were tried. In general, four to six iterations seemed to be 
sufficient for any data set that was examined. If nodes are given 
subject numbers running from 1 to N, where N is the number of nodes, 
and these subject numbers are used as the first approximation for 
the Mj^*s, the process seems to work well for all types of data. 
In actual tests, when different subject numbers were assigned to 
individuals, the solution obtained was identical to the first soluti 
Vhich indicates that the process is not terribly sensitive to the 
original positions. Usually, the wf^'s and S^'s are given equal 



weight, although this has not been tested extensively. 

The result of the application of this process is a continuum, 
N units long, with a scattering of nodes along its length, A 
sample network, together with the continuum that might result, is 
shown in Figure 3. This continuum is used as the input to the 
next stage of the analysis, in which tentative boundaries for 
groups are drawn. 

Drawing the Tentative Boundaries 

For any huoan observer, even a casual glance at Figure 3 
will be enough to suggest that there are three clusters of nodes. 
The computer, however, must be told what a cluster looks like, 
and how to look for one. People probably identify a cluster as 
an area in ifhich there are a lot of nodes, surrounded by areas 
in which there are fewer nodes. This is essentially what we have 

the machine look for. 

We will need a plot of the ••density" of nodes along the con* 
tinuum. In order to get such a plot, we construct a ••window" and 
move it along the continuum, counting the number of nodes visible 
through the window at each point. This is shown at the top of 
Figure 4. Eictensive testing has lead to the conclusion that the 
most efficient way to proceed is to center the window over each 
node, father than to ••slide- it gradually down the continuum. 
The optimum size of the window, also determined by experimentation, 
appears t6 be atout two units on an N unit line. Windows smaller 
than this introduce spurious statistical information, while with 
windows larger than this, group boundaries tend to blur and merge 
into indistinction. This is shown in Figure 4, where density plots 



Figure 3 

The top of this figure shove a hypothetical network 
composed of twenty nodes. Group boundaries are indicated 
by the dashed lines* 

The bottom shows what the final continuum might look 
like for the network shown in the top* Again, the group 
boundaries have been indicated by dashed lines. 
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Figure 4 

This figxure shows hov the density plot is made. The 
example uses the continuum shovn in Figure 3« In the top part, 
the window is shown, centered successively on the first eight 
nodes* 

The three hsr graphs in the niddl»tt show the effects of 
differently sized windows. 

On the bottom is shown the refined version of the plot, with 
nunbers of nodes visible to the right of the center of the 
window plotted above the horizontal and numbers visible on the 
left of the window plotted below the horizontal. 
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appear for windows of varying widths. The result of moving the 
windown down the continuum will be a list of densities, with one 
value for each individual. Such a list could be represented as 
a bar plot like the one shown in Figure 4. 

with this representation, groups will look like mounds, with 
boundaries between groups being indicated fay low points. Although 
it seems as though this representation wo\ild be adequate, there 
arose problems which lead to an improvement over this simple plot. 
Although the problems will not be discussed here, the improvement 
Willi Instead of just counting the number of nodes visible through 
the window » two numbers are counted— the number visible on the 
right half of the window, and the number visible on the left half. 
When constructing the bar graph, the number visible on the right 
half is plotted above the horizontal, while the number visible on 
the left half is plotted below the horisintal. The result is 
shown at the bottom of Figure 4. 

The final step in this stage is to have the computer draw 
lines aroimd the groups. The way this is done is fay locating 
spots at which there is a large change as we move from one point 
on the continuum to the next. If we count the number of non-over- 
lapping points and divide by the number of overlapping points for 
each pair of adjacent nodes on the final bar plot, we will have 
a fairly sensitive indicator of group continuity. This is shown 
in Figure 5. High values for this ratio will indicate that there 
is a large change as we move from one node to the next. Low values, 
on the other hand, will indicate that there is only a small change. 
If we choose a cutting point, and instruct the computer to draw a 
line whenever the ratio goes above the cutting point, we will have 
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Figure 5 

This figure illustrates the boundary -drawing process. 
The density plot on the bottom of Figure 4 ic shown on the 
top of this figure. The table below the plot shows the 
number of overlapping points, the number of non-overlapping 
points, and the ratio of the two numbers i for each successive 
pair of bars on the bar plot. 

The ratios are plotted in the graph in the middle of the 
page. The three dotted lines show the three different cutting 
points. 

Below the ratio plot, the original continuum is shown three 
times. The first shows the effect of a high cutting point, 
while the second and third ones show the results for moderate 
and low values of the cutting point. 
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told the computer hov to draw the boundaries around groups. If 
the value of the cutting point Isv^riaM^ we can alter the sensiti- 
vity of the group spotting routine in either direction, with a 
window of two units, a cutting point of 1.0 appears to be optimum 
for most networks. Different values, along with the results, are 
shown in Figure 5. 

This concludes the approximate phase of the analysis. The 
result of this stage is a list of tentative groups of nodes. The 
next part of the analysis involves the testing of this tentative 
solution, and any alteration that may have to be done to "clean it 
up." 

Using the Criteria for an Exact Solution 

This part of the analysis can be divided into two parts. In the 
first, individual nodes are tested to see if they meet the relevant 
criteria for their role in the network. If they do not, the appro* 
priate changes are made* In the second, whole groups are tested 
for the criteria that are relevant at that level. Again, appropriate 
changes are made if necessary. We begin with the individual testing, 
which is very single. 

Individual Testing 

First, people not in groups are tested to see if they meet the 
o(fcr iter ion for either liaison or group membership in any group, if 
any individual does meet the criterion, he is reclassified on that 
basis. If the individual fails both tests, he is labelled as "type 
other 
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Second » members of groups are tested to see if they meet 
the owcritorlon for group membership. Again, if the criterion 
is not met the appropriate changes are made. 

Beciuse changes made at any point in time can affect the 
roles of other people who were tested earlier, the tests are 
applied twice, to make sure that the final classification will 
be consistent with itself. 

Group Testing 

In this section we change our level of analysis to whole 
groups, rather than separate individuals. The criteria to be 
tested in this part are the connect iveness and critical link/node 
criteria. Since the information generated in the testing of the 
connectiveness criterion is necessary in the testing of the 
other two, it will be covered first. 

The basic device used in the testing of these criteria is 
the distance matrix, which is constructed for each group. In 
this n by n matrix (n is the number of members in the group), 
the element in row i, column j gives the number of steps needed 
to get from individual i to individual j in the group. If there 
is some finite number in each element of the matrix, the group 
will be connected. This means that there will be some path 
from each individual in the group to every other individual in 
the group. The longest any path could ever be is n-1 steps. A 
sample network, together with its distance matrix is shown in 
Figure 6. 

The way the distance matrix is constructed is as follows* 

o 
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Figure 6 

At the top of Figure 6 is shown a hypothetical eight-node 
network. The matrix directly below the network is a binary 
version of the network. In this matrix, each node has a row 
and a coluim. The i. j entry of the matrix is 1 if node i la 

linked to node j. 

The second matrix is the distance matrix for the same 
network. The entry in the i, J element of the matrix is the 
number of links in the shortest path from node i to node j. 
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^ matrix is constructed in which there is a ro» and a column 
for each node In the group. AU the elements are initialized to 
^ro. Whenever there is a UnK from node i to node j we enter 
a 1 in row i. column j. If the link is reciprocated we also enter 
a 1 in row j, colum i.** 

we then repeatedly perform a boolean logic operation which is 
analogous to raising the matrix to successively higher an<^ higher 
powers, instead of entering the cross product of the ith row and 
the jth column as the i.J element in the product matrix, however, 
we enter the first power on which this value becomes non-rero. 
(This operation is performed with a series of nested DO-loops and 
IF statements in FORTRAN. With careful organization, the process 
can be optimised to tafce significantly less time to compute than 
a standard algebraic multiplication of matrices.) 

W«. stop raising the matrix to higher powers when one of two 
conditions obtains, either a) all off-diagonal elements become 
non-zero, which implies the group is connected! or b) When goina 
from any power )c to the next power k*l no entries change value, 
which implies the group is not connected at level k and win 
never be connected at anjr level.' 

If the group is not connected, it is split into a connected 
part and all the rest. Each of the two parts is then treated as 
a separate group, and subjects to all the tests that any group 
must \mdergo* 

At this point, there are only the critical links/nodes criteria 
remaining to be tested. These criteria serve as checks against 
situations like those shown in the bottom half of Figure 7. where 
two groups have been mistakenly identified as one. This situation 



-15- 

is generalized to include situations in which there are any number 
of multiple groups, connected in some relatively minimal way. 
which we wish to separate into distinct groups. The occurrance 
of these confusions is a result of the inelegance of the approximate 
techniques used in the first half of the analysis. For analytic 
purposes, it is practical to combine these two criteria into a 
single rule which says that no subset of some arbirtary size 
may be removed from a group and cause the group to become disconnects 
If there is such a subset, the group will be seen to be "really" 
two or more groupp *♦ As a result of this caartination, whenever 
two groups are joined by a bridge link (a linlc between members 
of different groups), one of the nodes of this link will be iden- 
tified as a liaison. That node will later be tested for the 
oc^riterion of group membership, and if he passes, will be returned 
to his group. 

The problem has thus been reduced to one of identifying any 
critical nodes which may exist in a group. If there is one. he 
will be the node with the lowest average distance from all other 
nodes. This is because all paths from nodes in either half of 
the group to the other half ^xat go through the critical node. 
The average distance from any node to all the other nodes is 
given by the average of all the entries in that node's row in the 
distance matrix. This is illustrated in Figure 7. If there is 
a set of critical nodes^ they will be the nodes with the smallest 
row means. 

The fact that critical nodes have lower row means than the other 
members suggests that there must be some variation in the row 
means if there are any critical nodes. We can take advantage of 
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Figure 7 

on the upper left-hand corner of this figure is shown a 
hypothetical nine-member network. To the right of this is the 
distance matrix for that network. The rightmost column of the matrix 
contains the means of the rows of the matrix. The values in this 
column are thus the mean number of steps it takes that node to reach 
all other nodes. The overall mean for the group, together with 
the standard deviation of the distribution of means, is shown below 
the matrix. 

The netw6rk in the bottom left-hand corner is an example of 
the kind of situation that occurs when two or more groups are iden- 
tified as a single group. Clearly, Node #5 is a liaison between 
the two groups. The middle matrix on the right half of the page 
is the distance matrix for this group. Note the relatively high 
standard deviation for this group, compared to the one above it. 

The third matrix was constructed after removing node #5. 
Note that there are no values for many of the elements, indicating 
that the group is no longer connected. The means shown for this 
bottom matrix are the values that would be obtained if the group 
were split in two, and the means for each group calculated separately. 
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of thia fact if we only look for critical nodes when there is some 
variance. It turns out that this leads to a large saving, in terms 
of coft.putatlon time. This is because of the way we test for 
critical nodes. 

To check a node to see if it is critical, we remove it from the 
group and re-calculate the distance matrix. If» as a result of 
the removal, the group becomes disconnected, we have found a critical 
node. If the group Is still connected, we try the next candidate— 
the node who, of all the remaining nodes, has the smallest row mean. 
We will usually stop this process after taking out some percentage 
of the original group (usually ten) if the group continues to remain 
connected. If this happens, we put all the removed nodes back into 
the group. 

It is easy to see that there is a lot of work involved in the 
searching for critical nodes. This is why the heuristic device 
of checking the variance of the row means is so important. In every 
network that has been examined so far, this heuristic has worked 
correctly. That is, it did not prevent any critical nodes from 
being found. Similarly, the approach of looking at nodes with 
thft lowest row means always seems to find the critical nodes. 
The optimum value to use as a cutting point for the variance test 
seems to be about 0.3. Whenever the standard deviation of the row 
means exceeds this value, there is likely to be a critical node. 
Whenever the standard deviation is less than this value, there is 
not. 

After all groups have passed these tests, the obtained classifi- 
cation of nodes to groups and other roles will be exact. At this 

o 
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point various Indices n«y be calculated and th« results .nbled in 
any convenient manner. K flow chart of the alqorithn. Is shovm in 
Figure 8. 

art Threft— Pgoqrai ««inq congii derations 

This section will discuss several aspects of the analysis 
technique that are relevant to the actual coding of a program 
to perform the analysis. Some of these involve programming 
approaches which make the program both more powerful and easier 
to write. While others include programming "tricKs'* that greatly 
increase the efficiency of the program, 
programming Approaches-General Considerations 

FORTRAN seems to be a good language to use for this type 
of program. The logical structure of FORTRAN is sufficiently 
powerful to handle the logic of the analysis, and the relative 
efficiency of FORTRAN maXes the large amount of arithmetic 
affordable. In addition, FORTRAN is widely available, easily 
learned, and easily written. Finally, the only operating version 
of the program was written in FORTRAN, and it would seem to be 
easier to do another program in the same language^^ rather than 

a different one. 

The other major general consideration involves internal 
data representation. The data should be stored in the form 
of variable length lists, rather than matrices. The use of 
the matrix format will greatly limit the capacity of the program, 
will make it prohibitively expensive to run, but will greatly 
simplify the programmer's task. To utilize a list processing 
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approach in FORTRAN is not difficult. A set of standard stat<»- 
ment functions will handle most routine list -processing procedures 
with ease. 

Careful organizabion of logical steps is crucial if an ef- 
ficient program is to be written. This is especially true in 
an algorithm as complex as this one, where huge numbers of deci- 
sions must be made about vast amounts of informatibn. The pro- 
grammer shonld expend 'Considerable amounts of time deciding how 
to organize both the data he is worlcing on and the operations 
he is performing on the data. In general, everything should 
be modularized, standardized, and clearly organized. This meanst 

a) The algorithm should be broken down into logical steps, 

b) All notation— i.e. variable names, statement numbers, 
logical organization, etc.— must be consistent throughout 
the program. 

c) Extreme care must be taken to clearly organize the code. 
That is, code should be ••clean", concise, and "nice". If 
a section "feels clumsy" there is probably a better %ray 
to do it, and it is worth looking for a better way. 

Documentation is essential. It is probably not possible to 
write a program for this algorithm without extensive documentation. 
This means flow charts, verbal descriptions of logic flows and 
objectives, memory use maps, and comment cards are all essential. 

Programming "Tricks"— Optimization 

There is a lot of room to optimize in this kind of program. 
Careful attention to organization at all levels of analysis win 
reveal parallel logic processes which could be handled by the 



saniQ section of code, for example. There are many points 
at which a single recursive loop of code, although norw dif- 
ficult to write than a sequential series of instructions, 
will perform an operation faster and more efficiently. 

All list searching can be optimized if the lists areordered 
i.\ such a way as to minimize mean search time. The use of 
indirect addressing in most of these kinds of situations can 
make many search operations almost automatic. 

Finally, the use of a CDC machine (or one similar to it) 
with its more powerful version of Extended FORTRAN, rather 
than an IBM with standard FORTRAN G or H, will make significant 
differences in the ease with which the program is written, 
as well as the actual cost of running the final program. 

Because different computer installations have different 

ways of doing certain things, it is not possible to be more 

8 

specific with any of these comments. 

Part Four^-General Characteristics of the Netw ork Analysis Program 

The only implementation of the algorithm, as of this date, 
is an Extended FORTRAN program which operates on the CDC 6500 
at Michigan State University, The code for the entire program 
including data cleaning routines and con?>lex table-producing 
routines, occupies apprdximately 2,300 computer cards. The 
program is stored in the form of a compiled version of the 
most recent revision, eliminatirfq the compilation cost each time 
the program is run. 

The code alone occupies 31,000^. 60*bit words of core. 
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leavlng about 130,000q words for data. This gives a capacity 
of 4,095 nodes and 32,767 links* Execution time Increases 
linearly as a function of nstvork size and compleidtjr. Some 
times appear below. 

Number of links Number of Nodes Execution Time 

2000 725 57 seconds 

1000 270 68 seconds 

Execution time Is a function of network complexity, as 
well as the absolute size^ as can be seen above. The printout 
describes the network in great detail, and print charges 
usually exceed coiqpute charges by a factor of about four. 
The printout for a network of 1,000 nodes is three to five 
inches thick. 

The program is well protected against many user errors 
and odd data configurations. It has been tested on random 
data, and it performed as expected, (9) The largest network 
ever run was with an N of 960, Estimates suggest that to do 
this analysis by hand would take ten tireless errorless men 
over a century. It took the computer less than two minutes. 

Part Five— Historical Development 

Work was begun in this area at MSU in late 1970, The 
first working program, operationalizing parts of the algorithm 
described in (1) was completed in late 1971. The output of 
that program was a large matrix, with rows and columns arranged 
so that people who talked to each other were close to each other 
in the matrix. In June of 1972 the analytic methods used were 

o refined and extended to include the group detection routines. 
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These were further refined and extensivaly tested in Suntner and 
Pall of 1973. 

In the time from 1972 to the present, the program wa« 
continuously being in^roved* as better and better methods of 
problem solving were discovered. In addition, errors of various 
types were tracked down and fixed. The theoretical basis on 
which the program stands was being refined at the same time 
improvements were being made on the program. 

At this point in time, we know of no significant errors in 
the program itself, although users of the program sometimes 
make errors when running their data through it. Attempts 
are being made to include routines in the program which will 
identify even these kinds of error, thus protecting the user 
from himself. 
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1» vriiliM D« Richards is a graduate stadsnt at Stanford's 
Institute for CooMunication Rssssreh. 

2. In the past» sobs investigators have limited the number of 

linKs per node to soms constant* Xilce three or four. There 
is no reason to do this, as it severely distorts the data. 

3. Many of the most intarssting properties of networks don't 

sees to be found in saall simple systsms. VTe have seen 
several moderately large networKs which show qualitatively 
different properties than smaller ones, perhaps very large 
netiforks will be different in similar ways. 

4. This analogy may have been suggested hf James A. Danowsici, 

a colleague of mine who has providsd much invaluable assistance 
thrr ughout the developnsnt of this laethodology. 

5. K major problot with the examples used to illustrate the dif* 

ferent parts of the algorithm is that they are too simple 
to accurate mirror the kinds of things that happen with real 
data. This 8in|>licity was felt to be necessary » if the 
examples were to be dear. 
The sise of the window is given a good information theoretical 
treatment in a paper by Oauthier. (tt)) 

6. f^ith real networks, the bar plot is much more cosqplex and rich 

in detail. There are usually more nodes in each group, and 
the shapes of the groups in the bar plot is strikingly dif- 
ferent from the ones seen in Figure 4. However, the ones 
shown there do illustrate the eowsept being put forth. 

7. The author was not able to devise a proof for this theorem, 

so he tested it empirically with a large number of examples. 
It never received any disconfirming evidence, and if a T-test 
were done on the results, the significance level would be 
with p less than 0.00001. He is therefore confident in the 
truth of the theorem « 

8. The author suggests that he is not primarily a computer progranmei 

and regrets that his research interests prevent him from 
learning the peculiarities of other coiqputer systems, which 
would allow him to be of more assistance to others working 
in the area. He is, however, willing to discuss any problems 
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that say be enoountered in attMpting to program the algorithm. 



Aeknovledaeaante 

Professor Vincent Farae* offered valuable moral assistance 
over the course of namy aontha of difficult progranosdng. Ja»es 
Danovski provided valuabla assistance irith the day to day aspects 
of the task. The DepartoMit of Conmmicatioii at M5U funded the 
work, and »y ovn departMsnct at Stanford allowed ne the freedoai to 
irork in a strange area at another place for long periods of time. 
People vho I didn't mention here but also deserve thanks knoir vho 
they are» and i thank them. Finally* thanks to Control Data 
Corporation* for developing a fine computer. 
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